Introduction


Customer Personality Analysis helps a business to better understand its customers and makes it easier to change or create a new product. Helping with decision making.


For example, the decision whether to invest money to market a new product, how much money to invest and what is the target customers.


The core of a customer personality analysis is getting the answers to questions such as:

Goal

This is project aims to perform unsupervised Machine Learning techniques to summarize customer segments. This project will work with concepts of data mining such as data understanding, data preparation, modeling, evaluation,and deployment.

About the Dataset

The dataset used on this analyze was obtained from Kaggle’s Customer Personality Analysis exercise

  • People
    • ID: Customer’s unique identifier
    • Year_Birth: Customer’s birth year
    • Education: Customer’s education level
    • Marital_Status: Customer’s marital status
    • Income: Customer’s yearly household income
    • Kidhome: Number of children in customer’s household
    • Teenhome: Number of teenagers in customer’s household
    • Dt_Customer: Date of customer’s enrollment with the company
    • Recency: Number of days since customer’s last purchase
    • Complain: 1 if the customer complained in the last 2 years, 0 otherwise
  • Products
    • MntWines: Amount spent on wine in last 2 years
    • MntFruits: Amount spent on fruits in last 2 years
    • MntMeatProducts: Amount spent on meat in last 2 years
    • MntFishProducts: Amount spent on fish in last 2 years
    • MntSweetProducts: Amount spent on sweets in last 2 years
    • MntGoldProds: Amount spent on gold in last 2 years
  • Promotion
    • NumDealsPurchases: Number of purchases made with a discount
    • AcceptedCmp1: 1 if customer accepted the offer in the 1st campaign, 0 otherwise
    • AcceptedCmp2: 1 if customer accepted the offer in the 2nd campaign, 0 otherwise
    • AcceptedCmp3: 1 if customer accepted the offer in the 3rd campaign, 0 otherwise
    • AcceptedCmp4: 1 if customer accepted the offer in the 4th campaign, 0 otherwise
    • AcceptedCmp5: 1 if customer accepted the offer in the 5th campaign, 0 otherwise
    • Response: 1 if customer accepted the offer in the last campaign, 0 otherwise
  • Place
    • NumWebPurchases: Number of purchases made through the company’s website
    • NumCatalogPurchases: Number of purchases made using a catalogue
    • NumStorePurchases: Number of purchases made directly in stores
    • NumWebVisitsMonth: Number of visits to company’s website in the last month

Load Data and Import libraries

market_campaign <- as.data.frame(read.csv("marketing_campaign.csv"))

Data preparation

Knowing my Data

glimpse(market_campaign)
## Rows: 2,240
## Columns: 1
## $ ID.Year_Birth.Education.Marital_Status.Income.Kidhome.Teenhome.Dt_Customer.Recency.MntWines.MntFruits.MntMeatProducts.MntFishProducts.MntSweetProducts.MntGoldProds.NumDealsPurchases.NumWebPurchases.NumCatalogPurchases.NumStorePurchases.NumWebVisitsMonth.AcceptedCmp3.AcceptedCmp4.AcceptedCmp5.AcceptedCmp1.AcceptedCmp2.Complain.Z_CostContact.Z_Revenue.Response <chr> …


From the output, it`s time to panic a bit, the data is all merged in a single column. But worry not, with the power of tidyverse we will undo that tangled mess. Prepare to witness the magic of untangling data chaos and creating order, one spell at a time!


So, lets start of counter-spell🪄


As we can observe the data is stored as type, so we are dealing with data that was stored as text. The parterns are: * column names are separated by “.” * observations have as separator tab) as we can see bellow.

head(market_campaign, n = 5)
##   ID.Year_Birth.Education.Marital_Status.Income.Kidhome.Teenhome.Dt_Customer.Recency.MntWines.MntFruits.MntMeatProducts.MntFishProducts.MntSweetProducts.MntGoldProds.NumDealsPurchases.NumWebPurchases.NumCatalogPurchases.NumStorePurchases.NumWebVisitsMo ...
## 1                                                                                                                             5524\t1957\tGraduation\tSingle\t58138\t0\t0\t04-09-2012\t58\t635\t88\t546\t172\t88\t88\t3\t8\t10\t4\t7\t0\t0\t0\t0\t0\t0\t3\t11\t1
## 2                                                                                                                                      2174\t1954\tGraduation\tSingle\t46344\t1\t1\t08-03-2014\t38\t11\t1\t6\t2\t1\t6\t2\t1\t1\t2\t5\t0\t0\t0\t0\t0\t0\t3\t11\t0
## 3                                                                                                                           4141\t1965\tGraduation\tTogether\t71613\t0\t0\t21-08-2013\t26\t426\t49\t127\t111\t21\t42\t1\t8\t2\t10\t4\t0\t0\t0\t0\t0\t0\t3\t11\t0
## 4                                                                                                                                  6182\t1984\tGraduation\tTogether\t26646\t1\t0\t10-02-2014\t26\t11\t4\t20\t10\t3\t5\t2\t2\t0\t4\t6\t0\t0\t0\t0\t0\t0\t3\t11\t0
## 5                                                                                                                                     5324\t1981\tPhD\tMarried\t58293\t1\t0\t19-01-2014\t94\t173\t43\t118\t46\t27\t15\t5\t5\t3\t6\t5\t0\t0\t0\t0\t0\t0\t3\t11\t0


So lets create step by step for this process in our Grimoire: * Split the data in different columns * the separator is the regex expression Convert each variable type * Rename dataset for new_market_campaign

# Split the data in different columns
market_campaign <- market_campaign %>% separate (ID.Year_Birth.Education.Marital_Status.Income.Kidhome.Teenhome.Dt_Customer.Recency.MntWines.MntFruits.MntMeatProducts.MntFishProducts.MntSweetProducts.MntGoldProds.NumDealsPurchases.NumWebPurchases.NumCatalogPurchases.NumStorePurchases.NumWebVisitsMonth.AcceptedCmp3.AcceptedCmp4.AcceptedCmp5.AcceptedCmp1.AcceptedCmp2.Complain.Z_CostContact.Z_Revenue.Response, into = c("ID","Year_Birth","Education","Marital_Status","Income","Kidhome","Teenhome","Dt_Customer","Recency", "MntWines","MntFruits",  "MntMeatProducts" , "MntFishProducts" ,  "MntSweetProducts","MntGoldProds", "NumDealsPurchases"  ,  "NumWebPurchases"     , "NumCatalogPurchases"  , "NumStorePurchases", "NumWebVisitsMonth"      ,  "AcceptedCmp3" ,  "AcceptedCmp4", "AcceptedCmp5" , "AcceptedCmp1","AcceptedCmp2", "Complain", "Z_CostContact", "Z_Revenue",  "Response" ), sep = "\\t")
  
glimpse(market_campaign)
## Rows: 2,240
## Columns: 29
## $ ID                  <chr> "5524", "2174", "4141", "6182", "5324", "7446", "9…
## $ Year_Birth          <chr> "1957", "1954", "1965", "1984", "1981", "1967", "1…
## $ Education           <chr> "Graduation", "Graduation", "Graduation", "Graduat…
## $ Marital_Status      <chr> "Single", "Single", "Together", "Together", "Marri…
## $ Income              <chr> "58138", "46344", "71613", "26646", "58293", "6251…
## $ Kidhome             <chr> "0", "1", "0", "1", "1", "0", "0", "1", "1", "1", …
## $ Teenhome            <chr> "0", "1", "0", "0", "0", "1", "1", "0", "0", "1", …
## $ Dt_Customer         <chr> "04-09-2012", "08-03-2014", "21-08-2013", "10-02-2…
## $ Recency             <chr> "58", "38", "26", "26", "94", "16", "34", "32", "1…
## $ MntWines            <chr> "635", "11", "426", "11", "173", "520", "235", "76…
## $ MntFruits           <chr> "88", "1", "49", "4", "43", "42", "65", "10", "0",…
## $ MntMeatProducts     <chr> "546", "6", "127", "20", "118", "98", "164", "56",…
## $ MntFishProducts     <chr> "172", "2", "111", "10", "46", "0", "50", "3", "3"…
## $ MntSweetProducts    <chr> "88", "1", "21", "3", "27", "42", "49", "1", "3", …
## $ MntGoldProds        <chr> "88", "6", "42", "5", "15", "14", "27", "23", "2",…
## $ NumDealsPurchases   <chr> "3", "2", "1", "2", "5", "2", "4", "2", "1", "1", …
## $ NumWebPurchases     <chr> "8", "1", "8", "2", "5", "6", "7", "4", "3", "1", …
## $ NumCatalogPurchases <chr> "10", "1", "2", "0", "3", "4", "3", "0", "0", "0",…
## $ NumStorePurchases   <chr> "4", "2", "10", "4", "6", "10", "7", "4", "2", "0"…
## $ NumWebVisitsMonth   <chr> "7", "5", "4", "6", "5", "6", "6", "8", "9", "20",…
## $ AcceptedCmp3        <chr> "0", "0", "0", "0", "0", "0", "0", "0", "0", "1", …
## $ AcceptedCmp4        <chr> "0", "0", "0", "0", "0", "0", "0", "0", "0", "0", …
## $ AcceptedCmp5        <chr> "0", "0", "0", "0", "0", "0", "0", "0", "0", "0", …
## $ AcceptedCmp1        <chr> "0", "0", "0", "0", "0", "0", "0", "0", "0", "0", …
## $ AcceptedCmp2        <chr> "0", "0", "0", "0", "0", "0", "0", "0", "0", "0", …
## $ Complain            <chr> "0", "0", "0", "0", "0", "0", "0", "0", "0", "0", …
## $ Z_CostContact       <chr> "3", "3", "3", "3", "3", "3", "3", "3", "3", "3", …
## $ Z_Revenue           <chr> "11", "11", "11", "11", "11", "11", "11", "11", "1…
## $ Response            <chr> "1", "0", "0", "0", "0", "0", "0", "0", "1", "0", …


Note that all data is stored as , the next step will be convert the ones that must be as factor, as numeric.

  • cols 2,5:20, 27 and 28 will be transformed as a numeric
  • cols. 21:26 and 29 are dummies they should be stored as factor to avoid errors.
  • cols 1,3,4 will remain as text
  • Dt_Customer must be changed for date type
# Convert var type
market_campaign <- market_campaign %>% mutate_at(vars(Year_Birth, Income, Kidhome, Teenhome, Recency,
                                                      MntWines,MntFruits,MntMeatProducts,
                                                      MntFishProducts, MntSweetProducts,MntGoldProds, NumDealsPurchases,
                                                      NumWebPurchases, NumWebVisitsMonth, NumCatalogPurchases,
                                                      NumStorePurchases, Z_CostContact, Z_Revenue, AcceptedCmp3, AcceptedCmp4,
                                                      AcceptedCmp5, AcceptedCmp1, AcceptedCmp2), 
                                                 as.numeric)

market_campaign$Dt_Customer  <- as.Date(market_campaign$Dt_Customer,  format = "%d-%m-%Y")
summary(market_campaign)
##       ID              Year_Birth    Education         Marital_Status    
##  Length:2240        Min.   :1893   Length:2240        Length:2240       
##  Class :character   1st Qu.:1959   Class :character   Class :character  
##  Mode  :character   Median :1970   Mode  :character   Mode  :character  
##                     Mean   :1969                                        
##                     3rd Qu.:1977                                        
##                     Max.   :1996                                        
##                                                                         
##      Income          Kidhome          Teenhome       Dt_Customer        
##  Min.   :  1730   Min.   :0.0000   Min.   :0.0000   Min.   :2012-07-30  
##  1st Qu.: 35303   1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:2013-01-16  
##  Median : 51382   Median :0.0000   Median :0.0000   Median :2013-07-08  
##  Mean   : 52247   Mean   :0.4442   Mean   :0.5062   Mean   :2013-07-10  
##  3rd Qu.: 68522   3rd Qu.:1.0000   3rd Qu.:1.0000   3rd Qu.:2013-12-30  
##  Max.   :666666   Max.   :2.0000   Max.   :2.0000   Max.   :2014-06-29  
##  NA's   :24                                                             
##     Recency         MntWines         MntFruits     MntMeatProducts 
##  Min.   : 0.00   Min.   :   0.00   Min.   :  0.0   Min.   :   0.0  
##  1st Qu.:24.00   1st Qu.:  23.75   1st Qu.:  1.0   1st Qu.:  16.0  
##  Median :49.00   Median : 173.50   Median :  8.0   Median :  67.0  
##  Mean   :49.11   Mean   : 303.94   Mean   : 26.3   Mean   : 166.9  
##  3rd Qu.:74.00   3rd Qu.: 504.25   3rd Qu.: 33.0   3rd Qu.: 232.0  
##  Max.   :99.00   Max.   :1493.00   Max.   :199.0   Max.   :1725.0  
##                                                                    
##  MntFishProducts  MntSweetProducts  MntGoldProds    NumDealsPurchases
##  Min.   :  0.00   Min.   :  0.00   Min.   :  0.00   Min.   : 0.000   
##  1st Qu.:  3.00   1st Qu.:  1.00   1st Qu.:  9.00   1st Qu.: 1.000   
##  Median : 12.00   Median :  8.00   Median : 24.00   Median : 2.000   
##  Mean   : 37.53   Mean   : 27.06   Mean   : 44.02   Mean   : 2.325   
##  3rd Qu.: 50.00   3rd Qu.: 33.00   3rd Qu.: 56.00   3rd Qu.: 3.000   
##  Max.   :259.00   Max.   :263.00   Max.   :362.00   Max.   :15.000   
##                                                                      
##  NumWebPurchases  NumCatalogPurchases NumStorePurchases NumWebVisitsMonth
##  Min.   : 0.000   Min.   : 0.000      Min.   : 0.00     Min.   : 0.000   
##  1st Qu.: 2.000   1st Qu.: 0.000      1st Qu.: 3.00     1st Qu.: 3.000   
##  Median : 4.000   Median : 2.000      Median : 5.00     Median : 6.000   
##  Mean   : 4.085   Mean   : 2.662      Mean   : 5.79     Mean   : 5.317   
##  3rd Qu.: 6.000   3rd Qu.: 4.000      3rd Qu.: 8.00     3rd Qu.: 7.000   
##  Max.   :27.000   Max.   :28.000      Max.   :13.00     Max.   :20.000   
##                                                                          
##   AcceptedCmp3      AcceptedCmp4      AcceptedCmp5      AcceptedCmp1    
##  Min.   :0.00000   Min.   :0.00000   Min.   :0.00000   Min.   :0.00000  
##  1st Qu.:0.00000   1st Qu.:0.00000   1st Qu.:0.00000   1st Qu.:0.00000  
##  Median :0.00000   Median :0.00000   Median :0.00000   Median :0.00000  
##  Mean   :0.07277   Mean   :0.07455   Mean   :0.07277   Mean   :0.06429  
##  3rd Qu.:0.00000   3rd Qu.:0.00000   3rd Qu.:0.00000   3rd Qu.:0.00000  
##  Max.   :1.00000   Max.   :1.00000   Max.   :1.00000   Max.   :1.00000  
##                                                                         
##   AcceptedCmp2       Complain         Z_CostContact   Z_Revenue 
##  Min.   :0.00000   Length:2240        Min.   :3     Min.   :11  
##  1st Qu.:0.00000   Class :character   1st Qu.:3     1st Qu.:11  
##  Median :0.00000   Mode  :character   Median :3     Median :11  
##  Mean   :0.01339                      Mean   :3     Mean   :11  
##  3rd Qu.:0.00000                      3rd Qu.:3     3rd Qu.:11  
##  Max.   :1.00000                      Max.   :3     Max.   :11  
##                                                                 
##    Response        
##  Length:2240       
##  Class :character  
##  Mode  :character  
##                    
##                    
##                    
## 


At this point were identified 24 NA and a outlier 666666 on Income. The NA will simply be removed, and total number of data-points after removing the rows with missing values is: 2216

market_campaign <- market_campaign %>% 
                        filter(!is.na(Income))

Calculating Additional Features


From Dt_Customer we can find: - The newest customer’s enrollment date at the records: 2014-12-06 - The oldest customer’s enrollment date at the records: 2012-01-08


Is interesting to add the Counting days that the customer is on our purchase list, we will do that by creating a new feature Days_of_register.

d1 = max(market_campaign$Dt_Customer) #taking it to be the newest customer
Days_of_register <- difftime(d1, market_campaign$Dt_Customer, units = "days")
market_campaign$Days_of_register <- as.numeric(Days_of_register)


About consumers it would be interesting to have data regarding * Age extract from “Year_Birth” subtracted from “Dt_Customer” * Total_Spent amount spent by the customer in various categories over the span of two years. * Living_With out of “Marital_Status” to extract the living situation of couples. * Children_Count sum of number of kids and teenagers. * Family_Size * Is_Parent to indicate parenthood status 0 or 1

# Age calculation
data_colect_year <- format(market_campaign$Dt_Customer, "%Y")# converting for the same format
market_campaign$age <- as.integer(data_colect_year) - market_campaign$Year_Birth

# Total_Spent
market_campaign$Total_Spent <- as.integer (market_campaign$MntWines + market_campaign$MntFruits+ 
                                   market_campaign$MntMeatProducts +
                                   market_campaign$MntFishProducts +
                                   market_campaign$MntSweetProducts +
                                   market_campaign$MntGoldProds)

# Living_With
market_campaign$Living_With <- market_campaign$Marital_Status
market_campaign <- market_campaign %>%
  mutate(Living_With = case_when(
    Marital_Status %in% c("Married", "Together") ~ "Partner",
    Marital_Status %in% c("Absurd", "Widow", "YOLO", "Divorced", "Single") ~ "Alone",
    TRUE ~ ""
  ))

# Children_Count
market_campaign$Children_Count <- as.numeric(market_campaign$Kidhome + market_campaign$Teenhome)

# Family_Size

market_campaign$Family_Size <- as.numeric(market_campaign$Children_Count + 1) # countwithout partner

market_campaign <- market_campaign %>% mutate(Family_Size = case_when(Living_With == "Alone" ~ Family_Size + 0,
                                                                     Living_With == "Partner" ~ Family_Size +1,
                                                                      TRUE ~ Family_Size  # Retain Family_Size for other cases
         ))

# Is_Parent

market_campaign$Is_Parent <- if_else(market_campaign$Children_Count > 0, 1, 0)


From purchase habits is interesting to have the number of promotion accepted by certain client.

market_campaign$Total_AcceptedCmp <- as.integer(market_campaign$AcceptedCmp3+
                                                  market_campaign$AcceptedCmp4+
                                                  market_campaign$AcceptedCmp5+
                                                  market_campaign$AcceptedCmp1+
                                                  market_campaign$AcceptedCmp2)


For better visualization I will rename the colunms that have product type on it.

# Renaming columns
new_prod_names <- str_sub(colnames(market_campaign[,c(10:15)]), 4,20)
prod_names <- c("MntWines", "MntFruits", "MntMeatProducts", "MntFishProducts", "MntSweetProducts",
                    "MntGoldProds")
new_prod_names <- gsub("Products|Prods|Mnt", "" , prod_names)

for (i in 10:15) {
  j <- i - 9  # Adjust j to start from 1
  if (j <= length(new_prod_names)) {
    colnames(market_campaign)[i] <- new_prod_names[j]
  } else {
    break  # Exit the loop if j exceeds new_prod_names length
  }
}

All good! Data Prepared and ready to the Exploratory Analises Time to dive in into the Middle-Earth and talk to the Statistics Wizzard!

Exploratory Analises

Frequency Table

freq_tab_ed <- table(market_campaign$Education)
print(freq_tab_ed)
## 
##   2n Cycle      Basic Graduation     Master        PhD 
##        200         54       1116        365        481
freq_tab_marstat <- table(market_campaign$Marital_Status)
print(freq_tab_marstat)
## 
##   Absurd    Alone Divorced  Married   Single Together    Widow     YOLO 
##        2        3      232      857      471      573       76        2

Descriptive Stats

summary(market_campaign)
##       ID              Year_Birth    Education         Marital_Status    
##  Length:2216        Min.   :1893   Length:2216        Length:2216       
##  Class :character   1st Qu.:1959   Class :character   Class :character  
##  Mode  :character   Median :1970   Mode  :character   Mode  :character  
##                     Mean   :1969                                        
##                     3rd Qu.:1977                                        
##                     Max.   :1996                                        
##      Income          Kidhome          Teenhome       Dt_Customer        
##  Min.   :  1730   Min.   :0.0000   Min.   :0.0000   Min.   :2012-07-30  
##  1st Qu.: 35303   1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:2013-01-16  
##  Median : 51382   Median :0.0000   Median :0.0000   Median :2013-07-08  
##  Mean   : 52247   Mean   :0.4418   Mean   :0.5054   Mean   :2013-07-10  
##  3rd Qu.: 68522   3rd Qu.:1.0000   3rd Qu.:1.0000   3rd Qu.:2013-12-31  
##  Max.   :666666   Max.   :2.0000   Max.   :2.0000   Max.   :2014-06-29  
##     Recency          Wines            Fruits            Meat       
##  Min.   : 0.00   Min.   :   0.0   Min.   :  0.00   Min.   :   0.0  
##  1st Qu.:24.00   1st Qu.:  24.0   1st Qu.:  2.00   1st Qu.:  16.0  
##  Median :49.00   Median : 174.5   Median :  8.00   Median :  68.0  
##  Mean   :49.01   Mean   : 305.1   Mean   : 26.36   Mean   : 167.0  
##  3rd Qu.:74.00   3rd Qu.: 505.0   3rd Qu.: 33.00   3rd Qu.: 232.2  
##  Max.   :99.00   Max.   :1493.0   Max.   :199.00   Max.   :1725.0  
##       Fish            Sweet             Gold        NumDealsPurchases
##  Min.   :  0.00   Min.   :  0.00   Min.   :  0.00   Min.   : 0.000   
##  1st Qu.:  3.00   1st Qu.:  1.00   1st Qu.:  9.00   1st Qu.: 1.000   
##  Median : 12.00   Median :  8.00   Median : 24.50   Median : 2.000   
##  Mean   : 37.64   Mean   : 27.03   Mean   : 43.97   Mean   : 2.324   
##  3rd Qu.: 50.00   3rd Qu.: 33.00   3rd Qu.: 56.00   3rd Qu.: 3.000   
##  Max.   :259.00   Max.   :262.00   Max.   :321.00   Max.   :15.000   
##  NumWebPurchases  NumCatalogPurchases NumStorePurchases NumWebVisitsMonth
##  Min.   : 0.000   Min.   : 0.000      Min.   : 0.000    Min.   : 0.000   
##  1st Qu.: 2.000   1st Qu.: 0.000      1st Qu.: 3.000    1st Qu.: 3.000   
##  Median : 4.000   Median : 2.000      Median : 5.000    Median : 6.000   
##  Mean   : 4.085   Mean   : 2.671      Mean   : 5.801    Mean   : 5.319   
##  3rd Qu.: 6.000   3rd Qu.: 4.000      3rd Qu.: 8.000    3rd Qu.: 7.000   
##  Max.   :27.000   Max.   :28.000      Max.   :13.000    Max.   :20.000   
##   AcceptedCmp3      AcceptedCmp4      AcceptedCmp5     AcceptedCmp1    
##  Min.   :0.00000   Min.   :0.00000   Min.   :0.0000   Min.   :0.00000  
##  1st Qu.:0.00000   1st Qu.:0.00000   1st Qu.:0.0000   1st Qu.:0.00000  
##  Median :0.00000   Median :0.00000   Median :0.0000   Median :0.00000  
##  Mean   :0.07356   Mean   :0.07401   Mean   :0.0731   Mean   :0.06408  
##  3rd Qu.:0.00000   3rd Qu.:0.00000   3rd Qu.:0.0000   3rd Qu.:0.00000  
##  Max.   :1.00000   Max.   :1.00000   Max.   :1.0000   Max.   :1.00000  
##   AcceptedCmp2       Complain         Z_CostContact   Z_Revenue 
##  Min.   :0.00000   Length:2216        Min.   :3     Min.   :11  
##  1st Qu.:0.00000   Class :character   1st Qu.:3     1st Qu.:11  
##  Median :0.00000   Mode  :character   Median :3     Median :11  
##  Mean   :0.01354                      Mean   :3     Mean   :11  
##  3rd Qu.:0.00000                      3rd Qu.:3     3rd Qu.:11  
##  Max.   :1.00000                      Max.   :3     Max.   :11  
##    Response         Days_of_register      age          Total_Spent    
##  Length:2216        Min.   :  0.0    Min.   : 16.00   Min.   :   5.0  
##  Class :character   1st Qu.:180.0    1st Qu.: 36.00   1st Qu.:  69.0  
##  Mode  :character   Median :355.5    Median : 43.00   Median : 396.5  
##                     Mean   :353.5    Mean   : 44.21   Mean   : 607.1  
##                     3rd Qu.:529.0    3rd Qu.: 54.00   3rd Qu.:1048.0  
##                     Max.   :699.0    Max.   :121.00   Max.   :2525.0  
##  Living_With        Children_Count    Family_Size      Is_Parent     
##  Length:2216        Min.   :0.0000   Min.   :1.000   Min.   :0.0000  
##  Class :character   1st Qu.:0.0000   1st Qu.:2.000   1st Qu.:0.0000  
##  Mode  :character   Median :1.0000   Median :3.000   Median :1.0000  
##                     Mean   :0.9472   Mean   :2.593   Mean   :0.7144  
##                     3rd Qu.:1.0000   3rd Qu.:3.000   3rd Qu.:1.0000  
##                     Max.   :3.0000   Max.   :5.000   Max.   :1.0000  
##  Total_AcceptedCmp
##  Min.   :0.0000   
##  1st Qu.:0.0000   
##  Median :0.0000   
##  Mean   :0.2983   
##  3rd Qu.:0.0000   
##  Max.   :4.0000


Note that at this moment were identified some outliers for Income the value 666666 is very high, as well is the Age of 121 we have two. In both cases is not possible determine if the numbers were a real input or an error during the data collection. Also those outliers have a big impact on the Mean. Before to decide how to deal with the outliers, lets plot a correlation graphic and boxplot.

#Boxplot for the first 10 variables

Outliers <- c(colnames(market_campaign[,c(5:7,9:15)]))
ggplotly(
  market_campaign[,c(5:7,9:15)] %>%
    melt() %>%
    ggplot(aes(x = variable, y = value, fill = variable)) +
    geom_boxplot() +
    geom_point(alpha = 0.5) +
    labs(x = "Variable",
         y = "Value") +
    scale_fill_manual("Legend:",
                      values = rainbow(n = 36)) +
    theme_dark()
)
## No id variables; using all as measure variables
#Boxplot for 16-26 variables

Outliers <- c(colnames(market_campaign[,c(16:19)]))
ggplotly(
  market_campaign[,c(16:20)] %>%
    melt() %>%
    ggplot(aes(x = variable, y = value, fill = variable)) +
    geom_boxplot() +
    geom_point(alpha = 0.5) +
    labs(x = "Variable",
         y = "Value") +
    scale_fill_manual("Legend:",
                      values = rainbow(n = 36)) +
    theme_dark()
)
## No id variables; using all as measure variables
ggplotly()
  market_campaign[,c(20,31,34,35)] %>%
    melt() %>%
    ggplot(aes(x = variable, y = value, fill = variable)) +
    geom_boxplot() +
    geom_point(alpha = 0.5) +
    labs(x = "Variable",
         y = "Value") +
    scale_fill_manual("Legend:",
                      values = rainbow(n = 36)) +
    theme_dark()
## No id variables; using all as measure variables

ggplotly()
  market_campaign[,c(30,32)] %>%
    melt() %>%
    ggplot(aes(x = variable, y = value, fill = variable)) +
    geom_boxplot() +
    geom_point(alpha = 0.5) +
    labs(x = "Variable",
         y = "Value") +
    scale_fill_manual("Legend:",
                      values = rainbow(n = 36)) +
    theme_dark()
## No id variables; using all as measure variables


Clearly, there are a few outliers in the Income and Age features. In this project it will be managed by deleting the outliers.

market_campaign <- market_campaign %>% filter((Income < 600000 | age < 114))
print(paste0("The total number of data-points after removing the outliers are: ", nrow(market_campaign)))
## [1] "The total number of data-points after removing the outliers are: 2216"

Checking for Correlations


Evans (1996) suggests for the absolute value of r:

  • .00-.19 “very weak”
  • .20-.39 “weak”
  • .40-.59 “moderate”
  • .60-.79 “strong”
  • .80-1.0 “very strong”
# Correlation matrix

matrix_1 <- market_campaign[, c(5, 9:20, 30:32, 34,35)]
cor_matrix <- cor(matrix_1)

# Plot the correlation matrix
corrplot(cor_matrix, method = "color")

matrix_2 <- market_campaign[, c(5, 9:16)]
pairs.panels(matrix_2)

matrix_3 <- market_campaign[, c( 17:20 ,30:32, 34,35)]
pairs.panels(matrix_3)


By the correlation matrix is possible to note that the correlation is:

  • strong positive correlation :
    • Between Income and Spent, Catalog_purchase, Meat, Wine
    • Between Spent and Catalog_purchase, Store_Purchase, Family_size, Is_Parent and Children_Count
  • very strong positive correlation:
    • Between Spent and Vine and Fish
    • Total_Spent Wine, NumCatalogPurchase and Meat
  • positive moderate between:
    • Fruit, Meat, Fish and Sweets
    • Income and all food products
    • Web purchase, StorePurchase and Total_Spent
    • CatalofPurchase and StorePurchase
    • WebVisits and ChildrenCount
  • negative moderate between:
    • Catalog_purchase and WebVisits
    • StorePurchase and WebVisits
    • WebVisits and Total_Spent
    • ChildrenCount, CatalogPurchase and Total_Spent
    • Family_Size and Total_Spent

Categorical Encoding


Categorical encoding is the process of converting categorical to numerical data so that a machine learning algorithm understands it. It simply converts categories to numbers.The two most widely used techniques are:

  • Label Encoding: a unique integer or alphabetical ordering represents each label.
  • One-Hot Encoding: is the process of creating dummy variables.
Table One-Hot vs. Label Encoding
Table One-Hot vs. Label Encoding

We apply One-Hot Encoding when:

The categorical feature is not ordinal. The number of categorical features is is not big. In One-hot encoding each category is mutually exclusive. For example, “Red” may be encoded as [1, 0, 0], “Green” as [0, 1, 0], and “Blue” as [0, 0, 1].

We apply Label Encoding when:

The categorical feature is ordinal (like Jr. kg, Sr. kg, Primary school, high school) and the number of categories is quite large as one-hot encoding can lead to high memory consumption.

It preserves the ordinal relationship between categories if present. For example, “Red” may be encoded as 1, “Green” as 2, and “Blue” as 3.

We apply Ordinal Encoding:

Ordinal encoding is similar to label encoding but considers the order or rank of categories. For example, “Ocean” may be encoded as 1, “Sea” as 2, and “Coast” as 3.

Althought a Label encoding is space-efficient, it may introduce an arbitrary order to categorical values. One-hot encoding avoids this issue by creating binary columns for each category, but it can lead to high-dimensional data.

For the categorical vars in the present project is possible to use ordinal encoding for Education and label encoding for Marital_Status and Living_With. Or as there is no need to preserve the categorical rank we can simply apply Label Encoding for all vars except Living_With that can be done by one-hot encoding.

Thus, the label encoding will be applied for Education, Marital_Status and Living_With by one-hot encoding.

# Changing Education for numeric label
market_campaign$Education <- as.numeric(factor(market_campaign$Education))

# Changing the Living With
market_campaign$Marital_Status <- as.numeric(factor(market_campaign$Marital_Status))

market_campaign <- dummy_cols(market_campaign, select_columns = "Living_With", remove_selected_columns = TRUE)
market_campaign <- market_campaign[, -c(37,38)]

market_campaign <- market_campaign %>% mutate_at(vars(Complain, Response, Living_With_Partner), as.factor)


Note that the categories are : - Graduation: 3 - PhD: 5 - Master: 4 - Basic: 2 - 2n Cycle: 1


The labels for Marital_Status: - Single: 5 - Together: 6 - Married: 4 - Divorced: 3 - Widow: 7 - Alone: 2 - Absurd: 1 - YOLO : 8


Was created a new var called Living_With_Partner where: * Yes: 1 * No: 0

Removing Unnecessary Columns

market_campaign <- market_campaign[, -c(1,2,4,6:8,21:25,27,28)]
head(market_campaign)
##   Education Income Recency Wines Fruits Meat Fish Sweet Gold NumDealsPurchases
## 1         3  58138      58   635     88  546  172    88   88                 3
## 2         3  46344      38    11      1    6    2     1    6                 2
## 3         3  71613      26   426     49  127  111    21   42                 1
## 4         3  26646      26    11      4   20   10     3    5                 2
## 5         5  58293      94   173     43  118   46    27   15                 5
## 6         4  62513      16   520     42   98    0    42   14                 2
##   NumWebPurchases NumCatalogPurchases NumStorePurchases NumWebVisitsMonth
## 1               8                  10                 4                 7
## 2               1                   1                 2                 5
## 3               8                   2                10                 4
## 4               2                   0                 4                 6
## 5               5                   3                 6                 5
## 6               6                   4                10                 6
##   Complain Response Days_of_register age Total_Spent Children_Count Family_Size
## 1        0        1              663  55        1617              0           1
## 2        0        0              113  60          27              2           3
## 3        0        0              312  48         776              0           2
## 4        0        0              139  30          53              1           3
## 5        0        0              161  33         422              1           3
## 6        0        0              293  46         716              1           3
##   Is_Parent Total_AcceptedCmp Living_With_Partner
## 1         0                 0                   0
## 2         1                 0                   0
## 3         0                 0                   1
## 4         1                 0                   1
## 5         1                 0                   1
## 6         1                 0                   1

Dimensional Reduction


After clean all unnecessary data we still with 25 variables, many of them correlated. As that song from Depeche Mode would say “It’s no good!” but instead “be waiting patiently” to the data get smaller by itself is possible to perform dimensionality reduction.


Dimensionality reduction is the process of reducing the number of random variables under consideration, by obtaining a set of principal variables. And is here that the magic of modeling starts, with the PCA (Principal component analysis).

####Steps to run the PCA: 1. Standardize the d-dimensional dataset. 2. Check if is adequate conduce the PCA by the Bartlett’s Test of Sphericity 2. Construct the covariance matrix. 3. Decompose the covariance matrix into its eigenvectors and eigenvalues. 4. Sort the eigenvalues by decreasing order to rank the corresponding eigenvectors. 5. Select k eigenvectors, which correspond to the k largest eigenvalues, where k is the dimensionality of the new feature subspace (𝑘≤ 𝑑 ). 6. Construct a projection matrix, W, from the “top” k eigenvectors. 7. Transform the d-dimensional input dataset, using the projection matrix, to obtain the new k-dimensional feature subspace.

Standardize data


Before start is necessary to standardize the dataset.

# Standardize data
market_campaign <- market_campaign %>% mutate(across(everything(), as.numeric))
market_campaign_std <- market_campaign[, -c(1)]%>% 
                                          mutate(across(everything(), (scale))) %>% 
                                          as.matrix()

Bartlett’s Test of Sphericity


Before perform any a data reduction technique such as principal component analysis or factor analysis is necessary to verify if the data reduction can compress the data without loose meaningful variables.

The Bartlett’s Test of Sphericity compares an observed correlation matrix to the identity matrix, and checks if there is a redundancy between the variable.


The null(H0) hypothesis of the test is that the variables are orthogonal, i.e. not correlated. The alternative hypothesis (H1) is that the variables are not orthogonal, i.e. they are correlated enough to where the correlation matrix diverges significantly from the identity matrix.

        H0 : equal to identity matrix  (p-value > alpha)
        H1 : differs from the identity matrix (p-value < alpha)


Note: the Bartlett`s test must be performed on the original data rather than in the std version.

# Bartlett's Test
correl_matrix <- cor(market_campaign)
correl_matrix
##                       Education       Income       Recency        Wines
## Education            1.00000000  0.120692359 -0.0114182027  0.197886295
## Income               0.12069236  1.000000000 -0.0039697555  0.578649750
## Recency             -0.01141820 -0.003969756  1.0000000000  0.015721019
## Wines                0.19788630  0.578649750  0.0157210194  1.000000000
## Fruits              -0.08246221  0.430841681 -0.0058437499  0.387023861
## Meat                 0.03996094  0.584633357  0.0225176351  0.568860003
## Fish                -0.11474748  0.438871359  0.0005509232  0.397721050
## Sweet               -0.10727879  0.440743792  0.0251097703  0.390325802
## Gold                -0.09708394  0.325916446  0.0176626377  0.392730993
## NumDealsPurchases    0.02620787 -0.083100896  0.0021154508  0.008885929
## NumWebPurchases      0.08242518  0.387877811 -0.0056408538  0.553785939
## NumCatalogPurchases  0.06904876  0.589162442  0.0240814076  0.634752741
## NumStorePurchases    0.06779203  0.529362140 -0.0004338266  0.640011908
## NumWebVisitsMonth   -0.04082162 -0.553088012 -0.0185636434 -0.321977901
## Complain            -0.05086296 -0.027224512  0.0136366703 -0.039470211
## Response             0.09080602  0.133046664 -0.1997663693  0.246298957
## Days_of_register    -0.04890128 -0.018530777  0.0259625870  0.168049485
## age                  0.17306509  0.162556855  0.0146998145  0.150105920
## Total_Spent          0.09406922  0.667576090  0.0200656546  0.893135723
## Children_Count       0.05473709 -0.293351925  0.0182900870 -0.353747647
## Family_Size          0.03757485 -0.240147877  0.0144021277 -0.296388165
## Is_Parent            0.02344136 -0.338153413  0.0024851549 -0.343094115
## Total_AcceptedCmp    0.03760444  0.308381090 -0.0136471809  0.510832559
## Living_With_Partner -0.01456363  0.004663398 -0.0013710486 -0.007243788
##                           Fruits        Meat          Fish       Sweet
## Education           -0.082462215  0.03996094 -0.1147474825 -0.10727879
## Income               0.430841681  0.58463336  0.4388713595  0.44074379
## Recency             -0.005843750  0.02251764  0.0005509232  0.02510977
## Wines                0.387023861  0.56886000  0.3977210502  0.39032580
## Fruits               1.000000000  0.54782217  0.5934310503  0.57160606
## Meat                 0.547822166  1.00000000  0.5735740153  0.53513611
## Fish                 0.593431050  0.57357402  1.0000000000  0.58386696
## Sweet                0.571606063  0.53513611  0.5838669550  1.00000000
## Gold                 0.396486924  0.35944628  0.4271420401  0.35744975
## NumDealsPurchases   -0.134512099 -0.12130771 -0.1432410856 -0.12143193
## NumWebPurchases      0.302038849  0.30709037  0.2996875104  0.33393722
## NumCatalogPurchases  0.486263071  0.73412660  0.5327567837  0.49513582
## NumStorePurchases    0.458491031  0.48600555  0.4577450432  0.45522516
## NumWebVisitsMonth   -0.418728932 -0.53948442 -0.4464232918 -0.42237080
## Complain            -0.005324099 -0.02378194 -0.0212202304 -0.02264120
## Response             0.122442679  0.23774642  0.1081451099  0.11617037
## Days_of_register     0.067957968  0.08957703  0.0818344445  0.08098580
## age                  0.014556100  0.02910979  0.0364573564  0.01593226
## Total_Spent          0.613248760  0.84588420  0.6423707848  0.60706155
## Children_Count      -0.395900937 -0.50454471 -0.4278407939 -0.38941066
## Family_Size         -0.341153564 -0.42959246 -0.3633413615 -0.33048169
## Is_Parent           -0.411962985 -0.57493054 -0.4503175284 -0.40272161
## Total_AcceptedCmp    0.157090225  0.30718976  0.1761107066  0.20148510
## Living_With_Partner -0.025987882 -0.02331328 -0.0179863528 -0.01594848
##                            Gold NumDealsPurchases NumWebPurchases
## Education           -0.09708394      0.0262078719     0.082425179
## Income               0.32591645     -0.0831008957     0.387877811
## Recency              0.01766264      0.0021154508    -0.005640854
## Wines                0.39273099      0.0088859288     0.553785939
## Fruits               0.39648692     -0.1345120994     0.302038849
## Meat                 0.35944628     -0.1213077141     0.307090366
## Fish                 0.42714204     -0.1432410856     0.299687510
## Sweet                0.35744975     -0.1214319277     0.333937217
## Gold                 1.00000000      0.0519048294     0.407065666
## NumDealsPurchases    0.05190483      1.0000000000     0.241440318
## NumWebPurchases      0.40706567      0.2414403183     1.000000000
## NumCatalogPurchases  0.44242825     -0.0121184280     0.386867640
## NumStorePurchases    0.38918017      0.0661065938     0.516240183
## NumWebVisitsMonth   -0.24769056      0.3460483800    -0.051226263
## Complain            -0.03113346      0.0004972467    -0.016641779
## Response             0.14033164      0.0034510733     0.151431233
## Days_of_register     0.16050486      0.2184982775     0.192762503
## age                  0.05580780      0.0479144530     0.142887919
## Total_Spent          0.52870784     -0.0658538502     0.528973336
## Children_Count      -0.26891799      0.4360758116    -0.148870808
## Family_Size         -0.23584611      0.3744637483    -0.121295776
## Is_Parent           -0.24743268      0.3884245096    -0.074007634
## Total_AcceptedCmp    0.19682192     -0.1257271367     0.196823395
## Living_With_Partner -0.02544210      0.0261473157     0.003454789
##                     NumCatalogPurchases NumStorePurchases NumWebVisitsMonth
## Education                   0.069048763      0.0677920332      -0.040821622
## Income                      0.589162442      0.5293621403      -0.553088012
## Recency                     0.024081408     -0.0004338266      -0.018563643
## Wines                       0.634752741      0.6400119079      -0.321977901
## Fruits                      0.486263071      0.4584910315      -0.418728932
## Meat                        0.734126598      0.4860055453      -0.539484417
## Fish                        0.532756784      0.4577450432      -0.446423292
## Sweet                       0.495135818      0.4552251636      -0.422370804
## Gold                        0.442428252      0.3891801722      -0.247690557
## NumDealsPurchases          -0.012118428      0.0661065938       0.346048380
## NumWebPurchases             0.386867640      0.5162401827      -0.051226263
## NumCatalogPurchases         1.000000000      0.5178404511      -0.522003774
## NumStorePurchases           0.517840451      1.0000000000      -0.432398257
## NumWebVisitsMonth          -0.522003774     -0.4323982573       1.000000000
## Complain                   -0.020839191     -0.0169407070       0.019785006
## Response                    0.219913612      0.0362411292      -0.002208954
## Days_of_register            0.097610606      0.1126184473       0.276016404
## age                         0.116472244      0.1219227241      -0.137955315
## Total_Spent                 0.780481781      0.6751811033      -0.499081969
## Children_Count             -0.443473975     -0.3232127180       0.416076476
## Family_Size                -0.371974321     -0.2649185749       0.345705485
## Is_Parent                  -0.453470080     -0.2849271911       0.476234070
## Total_AcceptedCmp           0.346220700      0.2027434081      -0.165502004
## Living_With_Partner        -0.009854837      0.0045217548       0.003021724
##                          Complain     Response Days_of_register          age
## Education           -0.0508629608  0.090806018    -0.0489012830  0.173065089
## Income              -0.0272245123  0.133046664    -0.0185307766  0.162556855
## Recency              0.0136366703 -0.199766369     0.0259625870  0.014699814
## Wines               -0.0394702112  0.246298957     0.1680494850  0.150105920
## Fruits              -0.0053240986  0.122442679     0.0679579682  0.014556100
## Meat                -0.0237819441  0.237746418     0.0895770333  0.029109787
## Fish                -0.0212202304  0.108145110     0.0818344445  0.036457356
## Sweet               -0.0226412002  0.116170373     0.0809858011  0.015932264
## Gold                -0.0311334593  0.140331644     0.1605048638  0.055807800
## NumDealsPurchases    0.0004972467  0.003451073     0.2184982775  0.047914453
## NumWebPurchases     -0.0166417790  0.151431233     0.1927625029  0.142887919
## NumCatalogPurchases -0.0208391906  0.219913612     0.0976106064  0.116472244
## NumStorePurchases   -0.0169407070  0.036241129     0.1126184473  0.121922724
## NumWebVisitsMonth    0.0197850059 -0.002208954     0.2760164044 -0.137955315
## Complain             1.0000000000 -0.002029294     0.0332790490  0.028916787
## Response            -0.0020292937  1.000000000     0.1964793692 -0.033435733
## Days_of_register     0.0332790490  0.196479369     1.0000000000 -0.072164180
## age                  0.0289167872 -0.033435733    -0.0721641799  1.000000000
## Total_Spent         -0.0374276177  0.264127177     0.1585771047  0.104967891
## Children_Count       0.0317737462 -0.167648413    -0.0259413338  0.088891113
## Family_Size          0.0234414710 -0.217986778    -0.0271874589  0.072749508
## Is_Parent            0.0206080291 -0.203741851     0.0008881036 -0.012159163
## Total_AcceptedCmp   -0.0223880678  0.427123781    -0.0120094971  0.003565142
## Living_With_Partner -0.0053686489 -0.150161163    -0.0108512123 -0.001450542
##                     Total_Spent Children_Count Family_Size     Is_Parent
## Education            0.09406922     0.05473709  0.03757485  0.0234413637
## Income               0.66757609    -0.29335192 -0.24014788 -0.3381534132
## Recency              0.02006565     0.01829009  0.01440213  0.0024851549
## Wines                0.89313572    -0.35374765 -0.29638817 -0.3430941150
## Fruits               0.61324876    -0.39590094 -0.34115356 -0.4119629845
## Meat                 0.84588420    -0.50454471 -0.42959246 -0.5749305420
## Fish                 0.64237078    -0.42784079 -0.36334136 -0.4503175284
## Sweet                0.60706155    -0.38941066 -0.33048169 -0.4027216054
## Gold                 0.52870784    -0.26891799 -0.23584611 -0.2474326815
## NumDealsPurchases   -0.06585385     0.43607581  0.37446375  0.3884245096
## NumWebPurchases      0.52897334    -0.14887081 -0.12129578 -0.0740076339
## NumCatalogPurchases  0.78048178    -0.44347397 -0.37197432 -0.4534700798
## NumStorePurchases    0.67518110    -0.32321272 -0.26491857 -0.2849271911
## NumWebVisitsMonth   -0.49908197     0.41607648  0.34570548  0.4762340699
## Complain            -0.03742762     0.03177375  0.02344147  0.0206080291
## Response             0.26412718    -0.16764841 -0.21798678 -0.2037418514
## Days_of_register     0.15857710    -0.02594133 -0.02718746  0.0008881036
## age                  0.10496789     0.08889111  0.07274951 -0.0121591625
## Total_Spent          1.00000000    -0.50024427 -0.42394055 -0.5226292071
## Children_Count      -0.50024427     1.00000000  0.84932612  0.7998054843
## Family_Size         -0.42394055     0.84932612  1.00000000  0.6917811150
## Is_Parent           -0.52262921     0.79980548  0.69178111  1.0000000000
## Total_AcceptedCmp    0.45709549    -0.24589113 -0.20302825 -0.2797884829
## Living_With_Partner -0.01934746     0.04219466  0.56323540  0.0573789899
##                     Total_AcceptedCmp Living_With_Partner
## Education                0.0376044410       -0.0145636347
## Income                   0.3083810900        0.0046633983
## Recency                 -0.0136471809       -0.0013710486
## Wines                    0.5108325594       -0.0072437882
## Fruits                   0.1570902246       -0.0259878817
## Meat                     0.3071897637       -0.0233132823
## Fish                     0.1761107066       -0.0179863528
## Sweet                    0.2014851017       -0.0159484810
## Gold                     0.1968219238       -0.0254420990
## NumDealsPurchases       -0.1257271367        0.0261473157
## NumWebPurchases          0.1968233948        0.0034547889
## NumCatalogPurchases      0.3462206999       -0.0098548365
## NumStorePurchases        0.2027434081        0.0045217548
## NumWebVisitsMonth       -0.1655020044        0.0030217241
## Complain                -0.0223880678       -0.0053686489
## Response                 0.4271237810       -0.1501611634
## Days_of_register        -0.0120094971       -0.0108512123
## age                      0.0035651417       -0.0014505421
## Total_Spent              0.4570954896       -0.0193474565
## Children_Count          -0.2458911345        0.0421946621
## Family_Size             -0.2030282527        0.5632354034
## Is_Parent               -0.2797884829        0.0573789899
## Total_AcceptedCmp        1.0000000000        0.0006280779
## Living_With_Partner      0.0006280779        1.0000000000
rho <- cortest.bartlett((correl_matrix))
## Warning in cortest.bartlett((correl_matrix)): n not specified, 100 used
rho
## $chisq
## [1] 7558.289
## 
## $p.value
## [1] 0
## 
## $df
## [1] 276


For the current dataset X2Bartlett = 7298.936 for the Degree of Freedom 351 and alpha = 5, p-value = 0, then this dataset is suitable for a data reduction technique.

PCA

# Factors
fact_1 <- prcomp(market_campaign_std)
fviz_pca_var(fact_1, col.var="steelblue")# 


The cumulative variance of two principals is equal to 0.443. More Principal Components may be working its magic to explain enough variance. In order to determine the adequate number of PCs with Kaiser Criterion.

# Eigenvalues
eigenvalues <- round(as.numeric(fact_1$sdev^2))
print(k <- sum(eigenvalues))
## [1] 21
shared_variance <- as.data.frame (eigenvalues/k) %>% 
  slice(1:26) %>%
  melt()%>%
  mutate(PC = paste0("PC", row_number())) %>%
  rename(Shared_Variance = 1)
## No id variables; using all as measure variables
shared_variance %>%
  melt()%>%
  ggplot(aes(x = PC, y = value, fill = variable))+
  geom_col(col= "grey30", fill = "grey39")+
  geom_text(aes(label =  paste0(round(value * 100, 2), "%")), col = "black", vjust = -0.3, size = 2)+
  labs(x = "PC", y = "Shared Variance") +
  theme_gray(base_size =8)
## Using Shared_Variance, PC as id variables


At this step we have that the sum of the eigenvalues is 21 and also we have too many components. Is not even possible to differentiate them.Also is possible to visualize at the Chart that 11 PCs have to low contribution to the variance. We know that certain group of variables represented by a factor extracted from eigenvalues smaller than 1 are possibly not representing the behaviour of a original variable (exceptions are rare). Exceptions, usually occurs for values smaller but near to 1. The criteria of choice for number of eigenvalues > 1 is known as the Kaiser Criterion(a.k.a. Latent Root Criterion).

# Kaiser Criterion
k <- sum(eigenvalues > 1)
print(k)
## [1] 3


We have 3 eigenvalues remaining, and therefore accordingly to the criterion 3 Principal Components to be selected. So the PCs that will be kept are the 3 ones with that most contribute for the shared Variance. Therefore, it will be kept PC1, PC2 and PC3.


Now the next step is evaluate which variable constitute the major part of the PC.

# Running the PCA for 3 factors
# Contributions for PC1
var <- get_pca_var(fact_1) #variable extraction
a <- fviz_contrib(fact_1, "var", axes = 1, xtickslab.rt = 90)
print(plot(a, main = "Variables percentage contribution of first Principal Components"))


For the PC1 the variables that most contribute for the generation of the components were Meat, Wines, Fish, Income, Total_Spent, NumCatalogPurchase, Fruits, Sweets, NumWebVisitsMonth, NumStorePurchase,Family_Size and Child_Count.

# Contributions for PC2
b <- fviz_contrib(fact_1, "var", axes = 2, xtickslab.rt = 90)
print(plot(b, main = "Variables percentage contribution of first Principal Components"))


For PC2 the main contributors are NumDealsPurchases, Family_Size, NumWebPurchases, Children_Count, Is_Parent, Win and Days_of_Register.

# Contributios for PC3
c <- fviz_contrib(fact_1, "var", axes = 3 , xtickslab.rt = 90)
print(plot(c, main = "Variables percentage contribution of first Principal Components"))


At PC3 we have as bigger contributors the variables Response, Days_of_Register, NumWebVisits, Living_With_Partner, , Family_size and Total_AcceptedCmp.

Cluster Tendencies According to PCA


Time to run one scaterplot for the PC with k=3 and apply the Hopkins test to see if the dataset has a tendency to clusters.But why should we do it? Because a big issue is that clustering methods will return clusters even if the data does not contain any clusters.


The Hopkins statistic (Lawson and Jurs 1990) is used to assess the clustering tendency of a data set by measuring the probability that a given data set is generated by uniform data distribution. In other words, it tests the spatial randomness of the data.

  • H0: the data set D is uniformly distributed (i.e., no meaningful clusters)
  • H1: the data set D is not uniformly distributed (i.e., contains meaningful clusters)


We can conduct the Hopkins Statistic test iteratively, using 0.5 as the threshold to reject the alternative hypothesis. That is, if H < 0.5, then it is unlikely that D has statistically significant clusters. Put in other words, If the value of Hopkins statistic is close to 1, then we can reject the null hypothesis and conclude that the dataset D is significantly a clusterable data.

pca2 <-prcomp(market_campaign_std, center=FALSE, scale.=FALSE, rank. = 3) # stats::
results <- as.data.frame(pca2$x)

print(hop_stat <- clustertend:::hopkins(results, n = ceiling(nrow(results)/10)))
## Warning: Package `clustertend` is deprecated.  Use package `hopkins` instead.
## $H
## [1] 0.1859002


specifically for clustertend package the output value for the function hopkins() gives 1- Hopkins statistics, so smaller the statistic, the better chances of Clusters. It means that Hstat = 0.8257675 . Thus, as Hopkins statistic is close to 1, then we can reject the null hypothesis and conclude that the dataset is significantly a clusterable data.

# Create a color gradient from red to blue
color_palette <- brewer.pal(9, "GnBu")

# Create a vector of colors with the same length as the data
colors <- rep(color_palette, length.out = nrow(results))

# Create the scatterplot with the specified color palette
scatterplot3d(results$PC1, results$PC2, results$PC3,
              pch = 16, main = "3D Scatter Plot",
              xlab = "PC1", ylab = "PC2", zlab = "PC3",
              color = colors)


At this point is not possible to differentiate any cluster yet.


Now that we have reduced the dimensions for 4 using the magic of “DimensioNimbus” and “PCAtronus” we can jump to Clustering.
Call me Hermione DataChangers 🧙️

Clustering

#Elbow Method using results from pac2 as input
fviz_nbclust(results, kmeans, method ="wss", k.max = 20) +
  geom_vline(xintercept = 4, linetype = "dashed", color = "red" )


By the Elbow Chart the optimal number of Clusters(k) will be 4, that there is where it reduce the slope.Let’s try to fit the K-means Clustering Model to get the final clusters.

Hierarchical Clustering

# Calculates the matrix
dm <- dist(results)
hc <- hclust(dm, method = "complete")
coeficientes <- sort(hc$height, decreasing = FALSE) 

##K-means


The next step is to choose the most suitable distance metrics. Clustering for two different distance measures will be conducted, specifically for:

  • Euclidean distance
  • Manhattan distance


Only those two due the data characteristics (continuous and negative).

# Generating the model
kmeans_model <- eclust(results, "kmeans", hc_metric = "eucliden", k = 4)

fviz_silhouette(kmeans_model)
##   cluster size ave.sil.width
## 1       1  542          0.27
## 2       2  421          0.22
## 3       3  515          0.42
## 4       4  738          0.38

#### Clustering Raw data

km1 <- eclust(market_campaign_std, "kmeans", hc_metric="eucliden",k=4)

fviz_silhouette(km1)
##   cluster size ave.sil.width
## 1       1  552          0.06
## 2       2  374          0.10
## 3       3  504          0.12
## 4       4  786          0.22


Analysing the above results, Clustering on raw data definitely shows smaller average silhouette width. Thus, PCA analysis definitely helped and improved the final results of K-means clustering.

data <- as.data.frame(kmeans_model$data)
cluster_labels <- kmeans_model$cluster

kmeans_model$cluster %>%
scatterplot3d( data$PC1,   data$PC2, data$PC3,
              color = cluster_labels, pch = 16, main = "3D Scatter Plot",
              xlab = "PC1", ylab = "PC2", zlab = "PC3")

### Evaluating the models

cluster_rank <- as.data.frame(kmeans_model$cluster)
colnames(cluster_rank) <- "cluster"
cluster_rank %>%
  ggplot(aes(x = cluster, fill = cluster)) +
  geom_bar() + 
  labs(title = "Distribution of the Clusters") +
  theme_minimal() +
  theme(plot.title = element_text(size = 14, face = "bold"),
        legend.position = "none")
## Warning: The following aesthetics were dropped during statistical transformation: fill
## ℹ This can happen when ggplot fails to infer the correct grouping structure in
##   the data.
## ℹ Did you forget to specify a `group` aesthetic or to convert a numerical
##   variable into a factor?


The clusters seem to be fairly distributed. Let’s evaluate the cluster behavior in relation of Total_Spent and Income.

market_campaign$cluster <- kmeans_model$cluster
market_campaign <- as.data.frame(market_campaign)
market_campaign %>%
  ggplot(aes(x = Total_Spent , y = Income, color = factor(cluster))) +
  geom_point() +
  scale_color_manual(values = c("darkblue", "pink", "gold", "darkorchid")) +
  labs(title = "Cluster's Profile Based On Income And Spending") +
  theme_gray()


Income vs spending plot shows the clusters pattern

  • group 1: high spending & average income
  • group 3: high spending & high income
  • group 2: low spending & low income
  • group 4: low spending & low income
create_component_plots <- function(cluster_num) {
  # Subset data for the specific cluster
  cluster_data <- data[market_campaign$cluster == cluster_num, ]
  
  # Create individual component plots
  plot1 <- ggplot(market_campaign, aes(x = Wines, y = Total_Spent, color = factor(cluster))) +
  geom_point() +
  scale_color_manual(values = c("darkblue", "pink", "gold", "darkorchid")) +
  labs(title = "Cluster's Profile Wine") +
  theme_gray()
  
  plot2 <- ggplot(market_campaign, aes(x = Fruits, y = Total_Spent, color = factor(cluster))) +
  geom_point() +
  scale_color_manual(values = c("darkblue", "pink", "gold", "darkorchid")) +
  labs(title = "Cluster's Profile Fruits") +
  theme_gray()
  
  plot3 <- ggplot(market_campaign, aes(x = Meat, y = Total_Spent, color = factor(cluster))) +
  geom_point() +
  scale_color_manual(values = c("darkblue", "pink", "gold", "darkorchid")) +
  labs(title = "Cluster's Profile Meat") +
  theme_gray()
  
  plot4 <- ggplot(market_campaign, aes(x = Fish, y = Total_Spent, color = factor(cluster))) +
  geom_point() +
  scale_color_manual(values = c("darkblue", "pink", "gold", "darkorchid")) +
  labs(title = "Cluster's Profile Fish") +
  theme_gray()
  
  plot5 <- ggplot(market_campaign, aes(x = Sweet, y = Total_Spent, color = factor(cluster))) +
  geom_point() +
  scale_color_manual(values = c("darkblue", "pink", "gold", "darkorchid")) +
  labs(title = "Cluster's Profile Sweets") +
  theme_gray()
  
  plot6 <- ggplot(market_campaign, aes(x = Gold, y = Total_Spent, color = factor(cluster))) +
  geom_point() +
  scale_color_manual(values = c("darkblue", "pink", "gold", "darkorchid")) +
  labs(title = "Cluster's Profile Gold") +
  theme_gray()
  
  # Combine the component plots using grid.arrange
  grid.arrange(plot1, plot2, plot3, plot4, plot5, plot6, ncol = 2)
}

# Create a grid of subplots for each cluster
plot_grid <- lapply(unique(market_campaign$cluster), create_component_plots)

# Next step include a Totalnum_Prom_accepted that is the sum of prom accepted

market_campaign%>%
  ggplot(aes(x = Total_AcceptedCmp, fill = factor(cluster))) +
  geom_bar() +
  labs( x = "Number of Offers Accepted", y = "Total Count")+
  theme_gray()

There has not been an overwhelming response to the campaigns so far. Very few participants overall. Moreover, no one part take in all 5 of them. Perhaps better-targeted and well-planned campaigns are required to boost sales.

market_campaign %>%
  ggplot(aes(y  = Total_Spent, x= cluster, fill = factor(cluster))) +
  geom_boxplot()+
  theme_grey()

From the above plot, it can be clearly seen that cluster 3 is our biggest set of customers in terms of expenditures. We can explore what each cluster is spending on for the targeted marketing strategies.

market_campaign %>%
  ggplot(aes(y  = NumDealsPurchases, x= cluster, fill = factor(cluster))) +
  geom_boxplot()+
  theme_grey()

Unlike campaigns, the deals offered did had best outcome with cluster 1 and 4. However, our star customers cluster 3. Nothing seems to attract cluster 2 overwhelmingly.

PROFILING


Profiling involves generating descriptions of the clusters with reference to the input variables you used for the cluster analysis. Profiling acts as a class descriptor for the clusters and will help you to ‘tell a story’ so that you can understand this information and use it across your business.

To decide that I will be plotting some of the features that are indicative of the customer’s personal traits in light of the cluster they are in. On the basis of the outcomes, I will be arriving at the conclusions.

grouped_data <- market_campaign %>%
  group_by(cluster) %>%
  summarise(avg_children = mean(Children_Count), avg_fam = mean(Family_Size) )

plot_list <- lapply(unique(grouped_data$cluster), function(cluster) {
  data <- subset(grouped_data, cluster == cluster)
  ggplot(data, aes( x = cluster , y = avg_children, fill = factor(cluster))) +
    geom_col(position = "stack") +
    labs(y = "Average Children Count", title = "Average Children Count by Cluster") +
    theme_gray() +
    facet_wrap(~ cluster, ncol = 2)
})
plot_list
## [[1]]

## 
## [[2]]

## 
## [[3]]

## 
## [[4]]

pl2 <- lapply(unique(grouped_data$cluster), function(cluster) {
  data <- subset(grouped_data, cluster == cluster)
  ggplot(data, aes( x = cluster , y = avg_fam, fill = factor(cluster))) +
    geom_col(position = "stack") +
    labs(y = "Average Family Size", title = "Average Family Size by Cluster") +
    theme_gray() +
    facet_wrap(~ cluster, ncol = 2)
})

pl2
## [[1]]

## 
## [[2]]

## 
## [[3]]

## 
## [[4]]

customer_data <- c("Is_Parent", "Living_With_Partner")
pl3 <- lapply(customer_data, function(var) {
  ggplot(market_campaign, aes( y = factor(.data[[var]]),  fill = factor(cluster))) +
    geom_bar() +
    labs(x = "Clusters", y = var)+
    theme_gray()
})
grid.arrange(grobs = pl3, ncol = 1)

market_campaign_1 <- market_campaign %>% filter(cluster == 1)
plot_data_1 <- market_campaign_1 %>%
  ggplot(aes(y = age)) +
  geom_bar(fill = "pink") +
  labs( title = "Age vs Total_Spent for Cluster 1")+
  theme_gray()

market_campaign_2 <- market_campaign %>% filter(cluster == 2)
plot_data_2 <- market_campaign_2 %>%
  ggplot(aes(y = age)) +
  geom_bar(fill = "purple") +
  labs( title = "Age vs Total_Spent for Cluster 2")+
  theme_gray()

market_campaign_3 <- market_campaign %>% filter(cluster == 3)
plot_data_3 <- market_campaign_3 %>%
  ggplot(aes(y = age)) +
  geom_bar(fill = "darkblue") +
  labs( title = "Age vs Total_Spent for Cluster 3")+
  theme_gray()

market_campaign_4 <- market_campaign %>% filter(cluster == 4)
plot_data_4 <- market_campaign_4 %>%
  ggplot(aes(y = age)) +
  geom_bar(fill = "yellow") +
  labs( title = "Age vs Total_Spent for Cluster 4")+
  theme_gray()

grid.arrange(plot_data_1, plot_data_2, plot_data_3, plot_data_4, ncol = 2)

grouped_data_4 <- market_campaign %>%
  group_by(cluster) %>%
  summarise(avg_days = mean(Days_of_register))

plot_data_21 <- grouped_data_4 %>%
  group_by(cluster)%>%
  ggplot(aes(y = avg_days, x = cluster, fill = factor(cluster))) +
  geom_col() +
  theme_gray()
plot_data_21

plots_1 <- market_campaign %>%
  mutate(cluster = as.factor(cluster)) %>%
  ggplot() +
  geom_density_2d(aes(x = Education, y = Total_Spent, fill = cluster, color = cluster, group = cluster), alpha = 0.5, na.rm = FALSE) +
  scale_color_manual(values = c("deeppink", "green3", "dodgerblue2","darkorchid1")) +
  scale_fill_manual(values = c("deeppink", "green3", "dodgerblue2","darkorchid1")) +
  labs(x = "Education", y = "Total Spent", title = "Educational Profile vs. Total Spent") +
  theme_bw()
## Warning in geom_density_2d(aes(x = Education, y = Total_Spent, fill = cluster,
## : Ignoring unknown aesthetics: fill
plots_1


Cluster 1: - Average Children Count : is a second in this criteria being surpassed only by number 4 - Average Family Size : Maximal size 4 members, is a second in this criteria being surpassed only by number 4. - Living With a Partner : Single parents are a subset of this group, however is predominant the presence of Togheter status. - Age vs total spent : has its biggest number of consumers in between 30-65 y.o. - Days_of_register : contains consumers that are registered for longer period.


Cluster 2: - Family Size : Maximum value 3 - Is_Parent: majority of those consumers for this cluster are parents - Living With a Partner : Single parents are a subset of this group - Age vs total spent : relatively younger with age values that concentrates in between 20-45 y.o


Cluster 3: - Average Children Count : smallest number of children count. - Is_Parent: mainly composed by not parents - Average Family Size : smallest family sizes at max 2 members - Living with partner : slightly majority living as a couple reather than single. - Age vs total spent : more disperse ages 30-60 y.o - Education Level : between the consumers with educational level 3 is the cluster that tend to spend more.


Cluster 4 - Is_Parent: Yes, majority - Average Children Count : biggest between all clusters - Average Family Size : biggest in terms of average and with max. number of family members equals to 5 and minimal 2. - Living with partner : Yes, majority of them - Age vs total spent : concentrates between 25-70 y.o - Days_of_register: smallest 300 - Income: lower income group.

CONCLUSION

In this project, I`d performed unsupervised clustering. I did use dimensionality reduction using PCA followed by k-means. The final output from cluster up with 4 clusters that were used in profiling customers according to their family structures and income/spending. This can be used in planning better marketing strategies.

Bibliography

]Hill,N.andAlexander,J.(2017),theHandbookofCustomerSatisfactionandLoyalty Measurement, Routledge. Ivens,Bjoern,andKatharinaS.Valta.“CustomerBrandPersonalityPerception:A TaxonomicAnalysis.”JournalofMarketingManagement,vol.28,no.9-10,Taylor&Francis, 2012, pp. 1062–93, doi:10.1080/0267257X.2011.615149.